ABSTRACT
The unprecedented level of genome sequencing during the SARS-CoV-2 pandemic brought about the challenge of processing this genomic data. However, the state-of-the-art phylogenetic methods were mostly designed for analyzing data that are significantly sparser and require extensive subsampling of strains. We present (ε, τ) -MSN, a novel tool that reconstructs a viral genetic relatedness network based on genetic distances, that can process hundreds of thousands of sequences in under several hours. We applied (ε, τ) -MSN to the global COVID-19 outbreak data and were able to build a genetic network on more than 100,000 SARS-CoV-2 sequences. We show that (ε, τ) -MSN can accurately detect transmission events and build a genetic network with significantly higher assortativity with respect to continent and country attributes of SARS-CoV-2 samples. The source code for this software suite is available at https://github.com/Sergey-Knyazev/eMST. © 2021, Springer Nature Switzerland AG.